{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "19d53049",
   "metadata": {},
   "source": [
    "# 情感分析及数据集\n",
    ":label:`sec_sentiment`\n",
    "\n",
    "随着在线社交媒体和评论平台的快速发展，大量评论的数据被记录下来。这些数据具有支持决策过程的巨大潜力。\n",
    "*情感分析*（sentiment analysis）研究人们在文本中\n",
    "（如产品评论、博客评论和论坛讨论等）“隐藏”的情绪。\n",
    "它在广泛应用于政治（如公众对政策的情绪分析）、\n",
    "金融（如市场情绪分析）和营销（如产品研究和品牌管理）等领域。\n",
    "\n",
    "由于情感可以被分类为离散的极性或尺度（例如，积极的和消极的），我们可以将情感分析看作一项文本分类任务，它将可变长度的文本序列转换为固定长度的文本类别。在本章中，我们将使用斯坦福大学的[大型电影评论数据集（large movie review dataset）](https://ai.stanford.edu/~amaas/data/sentiment/)进行情感分析。它由一个训练集和一个测试集组成，其中包含从IMDb下载的25000个电影评论。在这两个数据集中，“积极”和“消极”标签的数量相同，表示不同的情感极性。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9bb8e4c2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.append('..')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ffc55a6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from d2l import mindspore as d2l\n",
    "from mindspore import nn, ops\n",
    "import numpy as np\n",
    "import os"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4b97e71",
   "metadata": {},
   "source": [
    "##  读取数据集\n",
    "\n",
    "首先，下载并提取路径`../data/aclImdb`中的IMDb评论数据集。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa23a672",
   "metadata": {},
   "outputs": [],
   "source": [
    "d2l.DATA_HUB['aclImdb'] = (\n",
    "    'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz',\n",
    "    '01ada507287d82875905620988597833ad4e0903')\n",
    "\n",
    "data_dir = d2l.download_extract('aclImdb', 'aclImdb')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6809a37c",
   "metadata": {},
   "source": [
    "接下来，读取训练和测试数据集。每个样本都是一个评论及其标签：1表示“积极”，0表示“消极”。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "76cf7444",
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_imdb(data_dir, is_train):\n",
    "    \"\"\"读取IMDb评论数据集文本序列和标签\"\"\"\n",
    "    data, labels = [], []\n",
    "    for label in ('pos', 'neg'):\n",
    "        folder_name = os.path.join(data_dir, 'train' if is_train else 'test',\n",
    "                                   label)\n",
    "        for file in os.listdir(folder_name):\n",
    "            with open(os.path.join(folder_name, file), 'rb') as f:\n",
    "                review = f.read().decode('utf-8').replace('\\n', '')\n",
    "                data.append(review)\n",
    "                labels.append(1 if label == 'pos' else 0)\n",
    "    return data, labels\n",
    "\n",
    "train_data = read_imdb(data_dir, is_train=True)\n",
    "print('训练集数目：', len(train_data[0]))\n",
    "for x, y in zip(train_data[0][:3], train_data[1][:3]):\n",
    "    print('标签：', y, 'review:', x[0:60])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cbeb835d",
   "metadata": {},
   "source": [
    "## 预处理数据集\n",
    "\n",
    "将每个单词作为一个词元，过滤掉出现不到5次的单词，我们从训练数据集中创建一个词表。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34809770",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_tokens = d2l.tokenize(train_data[0], token='word')\n",
    "vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=['<pad>'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60e693ee",
   "metadata": {},
   "source": [
    "在词元化之后，让我们绘制评论词元长度的直方图。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e2b8a0c",
   "metadata": {},
   "outputs": [],
   "source": [
    "d2l.set_figsize()\n",
    "d2l.plt.xlabel('# tokens per review')\n",
    "d2l.plt.ylabel('count')\n",
    "d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50));"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15af2370",
   "metadata": {},
   "source": [
    "正如我们所料，评论的长度各不相同。为了每次处理一小批量这样的评论，我们通过截断和填充将每个评论的长度设置为500。这类似于 :numref:`sec_machine_translation`中对机器翻译数据集的预处理步骤。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7a9d5d2e",
   "metadata": {},
   "outputs": [],
   "source": [
    "num_steps = 500  # 序列长度\n",
    "train_features = np.array([d2l.truncate_pad(\n",
    "    vocab[line], num_steps, vocab['<pad>']) for line in train_tokens])\n",
    "print(train_features.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf75ca67",
   "metadata": {},
   "source": [
    "## 创建数据迭代器\n",
    "\n",
    "现在我们可以创建数据迭代器了。在每次迭代中，都会返回一小批量样本。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d5c0c52",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_iter = d2l.load_array((train_features, train_data[1]), 64)\n",
    "\n",
    "for X, y in train_iter.create_tuple_iterator():\n",
    "    print('X:', X.shape, ', y:', y.shape)\n",
    "    break\n",
    "print('小批量数目：', train_iter.get_dataset_size())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15d8dea2",
   "metadata": {},
   "source": [
    "## 整合代码\n",
    "\n",
    "最后，我们将上述步骤封装到`load_data_imdb`函数中。它返回训练和测试数据迭代器以及IMDb评论数据集的词表。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9c87961",
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_data_imdb(batch_size, num_steps=500):\n",
    "    \"\"\"返回数据迭代器和IMDb评论数据集的词表\"\"\"\n",
    "    data_dir = d2l.download_extract('aclImdb', 'aclImdb')\n",
    "    train_data = read_imdb(data_dir, True)\n",
    "    test_data = read_imdb(data_dir, False)\n",
    "    train_tokens = d2l.tokenize(train_data[0], token='word')\n",
    "    test_tokens = d2l.tokenize(test_data[0], token='word')\n",
    "    vocab = d2l.Vocab(train_tokens, min_freq=5)\n",
    "    train_features = [d2l.truncate_pad(\n",
    "        vocab[line], num_steps, vocab['<pad>']) for line in train_tokens]\n",
    "    test_features = [d2l.truncate_pad(\n",
    "        vocab[line], num_steps, vocab['<pad>']) for line in test_tokens]\n",
    "    train_iter = load_array((train_features, train_data[1]), batch_size)\n",
    "    test_iter = load_array((test_features, test_data[1]),\n",
    "                            batch_size,\n",
    "                            is_train=False)\n",
    "    return train_iter, test_iter, vocab"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "638919ff",
   "metadata": {},
   "source": [
    "## 小结\n",
    "\n",
    "* 情感分析研究人们在文本中的情感，这被认为是一个文本分类问题，它将可变长度的文本序列进行转换转换为固定长度的文本类别。\n",
    "* 经过预处理后，我们可以使用词表将IMDb评论数据集加载到数据迭代器中。\n",
    "\n",
    "## 练习\n",
    "\n",
    "1. 我们可以修改本节中的哪些超参数来加速训练情感分析模型？\n",
    "1. 请实现一个函数来将[Amazon reviews](https://snap.stanford.edu/data/web-Amazon.html)的数据集加载到数据迭代器中进行情感分析。\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
